Take-home Exercise 3

Putting Visual Analytics into Practical Use.

Gong Shufen https://www.linkedin.com/public-profile/settings?trk=d_flagship3_profile_self_view_public_profile (SMU MITB(AT))https://scis.smu.edu.sg/master-it-business/analytics-track/curriculum
2022-02-20

1. Overview

In this take-home exercise, the task will be to create a data visualisation to segment kid drinks and other by nutrition indicators. For the purpose of this task, starbucks_drink.csv will be used.

2. Proposed Design

After observering the data set, because we don’t need to consider the relationship between different nutrition indicators, therefore I will use the heatmap to visualize the correspongding level of different nutrition indicators in different types of drinks which includes in the segment kid drinks and other.

Heatmaps visualise data through variations in colouring. When applied to a tabular format, heatmaps are useful for cross-examining multivariate data, through placing variables in the columns and observation (or records) in rowa and colouring the cells within the table. Heatmaps are good for showing variance across multiple variables, revealing any patterns, displaying whether any variables are similar to each other, and for detecting if any correlations exist in-between them.

3. Installing and loading the required libraries

Before we get started, it is important for us to ensure that the required R packages have been installed. For the purpose of the exercise, the follow packages will be used:

packages = c('seriation', 'dendextend', 'heatmaply', 'tidyverse')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

4. Data Import

The code chunk below imports starbucks_drink.csv into R environment by using read_csv() function of readr package in tidyverse family.

sd <- read_csv("data/starbucks_drink.csv")

5. Data Wrangling

Because we only need the cat the category of “kids drinks and other”. So the following code chunk will be used to filter this category by filter() function.

kids_sd <- sd %>%
  filter(Category == "kids-drinks-and-other")

After reviewing the data we have now, the Caffeine(mg) is classified as it is classified as a character rather than a numerical value instead which is wrong. Hence, I will use the following code chunk to convert it.

kids_sd$`Caffeine(mg)` <- parse_number(kids_sd$`Caffeine(mg)`)

Next, there is some missing value which is NA in the column of “Milk” and “Whipped Cream”. So it should be replace by “No Milk” and “No Whipped Cream” as following code chunk.

kids_sd[is.na(kids_sd)] <- "NA"
kids_sd$Milk[kids_sd$Milk == "NA"] <- "No Milk"
kids_sd$`Whipped Cream`[kids_sd$`Whipped Cream` == "NA"] <- "No Whipped Cream"

Then we check the missing value again to ensure that there is no missing value.

table(is.na(kids_sd))

FALSE 
 4716 

Besides, when go through the column value of Caffeine, there is an outlier which is 225mg and 24 fl oz whose size is short, it is not reasonable and might be a mistake. So it should be removed from data set.

In the data set, the Name, Milk and Whipped Cream are all chr and I decided to combine them as drink name using following code chunk.

kids_sd$Name = paste(kids_sd$Name,kids_sd$Milk, kids_sd$`Whipped Cream`)

For the Portion (fl oz) will affect the level of nutritional indicators, I will use the nutritional indicators value divided by the Portion (fl oz) mean to solve this influence.

kids_goruped <- kids_sd %>%
  group_by(`Name`) %>%
  summarise('Calories' = sum(`Calories`)/sum(`Portion(fl oz)`),
           'Calories from fat'  = sum(`Calories from fat`)/sum(`Portion(fl oz)`),
           'Total Fat(g)' = sum(`Total Fat(g)`)/sum(`Portion(fl oz)`),
           'Saturated fat(g)' = sum(`Saturated fat(g)`)/sum(`Portion(fl oz)`),
           'Trans fat(g)' = sum(`Trans fat(g)`)/sum(`Portion(fl oz)`),
           'Cholesterol(mg)' = sum(`Cholesterol(mg)`)/sum(`Portion(fl oz)`),
           'Sodium(mg)' = sum(`Sodium(mg)`)/sum(`Portion(fl oz)`),
           'Total Carbohydrate(g)' = sum(`Total Carbohydrate(g)`)/sum(`Portion(fl oz)`),
           'Dietary Fiber(g)' = sum(`Dietary Fiber(g)`)/sum(`Portion(fl oz)`),
           'Sugars(g)' = sum(`Sugars(g)`)/sum(`Portion(fl oz)`),
           'Protein(g)' = sum(`Protein(g)`)/sum(`Portion(fl oz)`),
           'Caffeine(mg)' = sum(`Caffeine(mg)`)/sum(`Portion(fl oz)`)) %>%
  ungroup()

After that, we need to set the drink names as the row number and convert to a data matrix to build a heat map as following code chunk.

row.names(kids_goruped) <- kids_goruped$Name
kids_gp_matrix <- data.matrix(kids_goruped)

6. Heatmap visualization

In the first step, we need to find out the best clustering method and number of cluster using the dend_expend() and find_k() functions.

kids_gp_mx <- dist(normalize(kids_gp_matrix[, -c(1)]), method = "euclidean")
dend_expend(kids_gp_mx)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.5614832
2      unknown        ward.D2 0.6088735
3      unknown         single 0.6646756
4      unknown       complete 0.6243221
5      unknown        average 0.7387914
6      unknown       mcquitty 0.6958625
7      unknown         median 0.5369151
8      unknown       centroid 0.6061457

According to the result above, the ‘average’ method should be used hence it has the highest optimum value.

Then for the best number of clusters, the find_k() function will be used.

kids_cluster <- hclust(kids_gp_mx, method = "average")
kids_k <- find_k(kids_cluster)
plot(kids_k)

As the graph above, the number of clusters should be 10, therefore the ‘average’ method and 10 clusters will be used to plot the heatmap as following.

heatmaply(normalize(kids_gp_matrix[,-c(1)]),
          dist_method = "euclidean",
          hclust_method = "average",
          seriate = "none",
          k_row = 10,
          colors = Blues,
          margins = c(NA,70,60,NA),
          fontsize_row = 5,
          fontsize_col = 7,
          xlab = "Nutritional Indicators",
          ylab = "Drink Types",
          main = "Starbucks Kids and Other Drinks’ Segment \nBy Nutrition Indicator",
          Colv = NA ,
          grid_color = "white",
          grid_gap = 0.1
          )

7. Insights

8. Reference

Building Heatmap with R